import time
def slow_function(word):
time.sleep(1)
return word.lower().count('a')Scientific Python antipatterns advent calendar day twenty three
For today, a research-software antipattern that often shows up in pipelines and notebooks: recomputing the same expensive thing over and over again. As a reminder, I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.
If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:
and I’ll send a single email at the end with links to them all.
Caching repeated computations
A lot of our code in scientific research involves slow computations on data. These can be processes that involve reading a lot of data from a disk, or that involve API calls to other computers, or simply computationally intensive things like parameter optimisation, maximum likelihood calculations, etc.
For a toy example, let’s make a function that is deliberately slow by throwing in a sleep:
We can check the output on a few data points:
fruits = ['apple', 'banana', 'grapefruit','apple','mango','grapefruit']
for fruit in fruits:
print(slow_function(fruit))1
3
1
1
1
1
and if we are running in a juptyer notebook, we can easily time how long it takes:
%%time
for fruit in fruits:
print(slow_function(fruit))1
3
1
1
1
1
CPU times: user 11.2 ms, sys: 532 μs, total: 11.7 ms
Wall time: 6 s
For this trivial example, it’s easy to see that we are wasting a bunch of time by doing the same calcuation multiple times. Our analysis function always returns the same output for the same input, so the second time we cound the number of as in apple, we know that we’re going to get the same answer as the first time.
There are a couple of ways to try and speed this up. If we create a dictionary and store the result every time we run the function, then we can re-use previously calculated values:
%%time
results = {}
for fruit in fruits:
if fruit in results:
print(results[fruit])
else:
answer = slow_function(fruit)
print(answer)
results[fruit] = answer1
3
1
1
1
1
CPU times: user 4.58 ms, sys: 3.98 ms, total: 8.56 ms
Wall time: 4 s
This does a good job of speeding up the code, but adds a bunch of complicated logic that is not really related to the core problem. We can make the code slightly cleaner by simply pre-calculating the answers for all unique values:
%%time
results = {}
for fruit in set(fruits):
results[fruit] = slow_function(fruit)
for fruit in fruits:
print(results[fruit])1
3
1
1
1
1
CPU times: user 1.51 ms, sys: 1.96 ms, total: 3.47 ms
Wall time: 4 s
This is just as fast as the previous version, and has the nice property that it separates out the dictionary logic from the iteration logic.
The cleanest solution, however, is to modify the function and keep the loop code as simple as possible. We can do this manually:
results = {}
def slow_function(word):
if word in results:
return results[word]
time.sleep(1)
answer = word.lower().count('a')
results[word] = answer
return answerThis speeds up the calling code without requiring any modification:
%%time
for fruit in fruits:
print(slow_function(fruit))1
3
1
1
1
1
CPU times: user 5.48 ms, sys: 1.28 ms, total: 6.76 ms
Wall time: 4 s
but it makes the function code less clear, and it’s easy to see that we might run into problems if we have multiple functions defined, each of which will need their own dictionary to store the results.
As usual, the standard library offers the nicest way to do this. The special lru decorator adds this behaviour (which we call cacheing) to any function with a special annotation line:
from functools import lru_cache
@lru_cache
def slow_function(word):
time.sleep(1)
return word.lower().count('a')This gives us a solution that’s efficient without requiring any change to either the function code or the calling code:
%%time
for fruit in fruits:
print(slow_function(fruit))1
3
1
1
1
1
CPU times: user 7.27 ms, sys: 1.93 ms, total: 9.2 ms
Wall time: 4 s
One more time; if you want to see the rest of these little write-ups, sign up for the mailing list: